Search CORE

UCL Discovery

Subfamily specific conservation profiles for proteins based on n-gram patterns

Author: F Fogolari
GP Raghava
H Joe
H W.
I Bahar
JC Wootton
JE Coronado
JK Vries
JK Vries
John K Vries
MO Dayhoff
MS Johnson
PC Mahalanobis
QW Dong
R Karchin
RD Finn
S Henikoff
S Henikoff
SF Altschul
SF Altschul
WS Valdar
WS Valdar
WS Valdar
Xiong Liu
Y Hou
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background A new algorithm has been developed for generating conservation profiles that reflect the evolutionary history of the subfamily associated with a query sequence. It is based on n-gram patterns (NP{<it>n,m</it>}) which are sets of <it>n </it>residues and <it>m </it>wildcards in windows of size <it>n+m</it>. The generation of conservation profiles is treated as a signal-to-noise problem where the signal is the count of n-gram patterns in target sequences that are similar to the query sequence and the noise is the count over all target sequences. The signal is differentiated from the noise by applying singular value decomposition to sets of target sequences rank ordered by similarity with respect to the query. Results The new algorithm was used to construct 4,248 profiles from 120 randomly selected Pfam-A families. These were compared to profiles generated from multiple alignments using the consensus approach. The two profiles were similar whenever the subfamily associated with the query sequence was well represented in the multiple alignment. It was possible to construct subfamily specific conservation profiles using the new algorithm for subfamilies with as few as five members. The speed of the new algorithm was comparable to the multiple alignment approach. Conclusion Subfamily specific conservation profiles can be generated by the new algorithm without aprioi knowledge of family relationships or domain architecture. This is useful when the subfamily contains multiple domains with different levels of representation in protein databases. It may also be applicable when the subfamily sample size is too small for the multiple alignment approach.</p

MSACompro: protein multiple sequence alignment using predicted secondary structure, solvent accessibility, and residue-residue contacts

Author: A Krogh
AN Tegge
C Notredame
CB Do
DF Feng
DG Higgins
DG Higgins
DG Higgins
DG Higgins
F Jeanmougin
F Wilcoxon
G Pollastri
GH Gonnet
GJ Barton
GP Raghava
GP Raghava
HY Zhou
J Cheng
J Heringa
J Pei
J Pei
J Pei
J Söding
J Söding
JD Thompson
JD Thompson
JD Thompson
JD Thompson
Jianlin Cheng
K Katoh
M Brudno
M Larkin
NK Kim
NS Boutonnet
O Poirot
O Poirot
PHA Sneath
R Chenna
R Durbin
RC Edgar
RC Edgar
RK Bradley
RS Amarendran
RS Amarendran
RS Amarendran
S Chikkagoudar
SE Brenner
SH Sze
T Kawabata
TL Bailey
U Roshan
V Walle
V Walle
Xin Deng
YC Liu
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Multiple Sequence Alignment (MSA) is a basic tool for bioinformatics research and analysis. It has been used essentially in almost all bioinformatics tasks such as protein structure modeling, gene and protein function prediction, DNA motif recognition, and phylogenetic analysis. Therefore, improving the accuracy of multiple sequence alignment is important for advancing many bioinformatics fields. Results We designed and developed a new method, MSACompro, to synergistically incorporate predicted secondary structure, relative solvent accessibility, and residue-residue contact information into the currently most accurate posterior probability-based MSA methods to improve the accuracy of multiple sequence alignments. The method is different from the multiple sequence alignment methods (e.g. 3D-Coffee) that use the tertiary structure information of some sequences since the structural information of our method is fully predicted from sequences. To the best of our knowledge, applying predicted relative solvent accessibility and contact map to multiple sequence alignment is novel. The rigorous benchmarking of our method to the standard benchmarks (i.e. BAliBASE, SABmark and OXBENCH) clearly demonstrated that incorporating predicted protein structural information improves the multiple sequence alignment accuracy over the leading multiple protein sequence alignment tools without using this information, such as MSAProbs, ProbCons, Probalign, T-coffee, MAFFT and MUSCLE. And the performance of the method is comparable to the state-of-the-art method PROMALS of using structural features and additional homologous sequences by slightly lower scores. Conclusion MSACompro is an efficient and reliable multiple protein sequence alignment tool that can effectively incorporate predicted protein structural information into multiple sequence alignment. The software is available at <url>http://sysbio.rnet.missouri.edu/multicom_toolbox/</url>.</p

Public Library of Science (PLOS)

Identification of Mannose Interacting Residues Using Local Composition

Author: A Garg
A Koch
A Malik
A Malik
Anna Tramontano
C Shionyu-Mitsuyama
C Taroni
E Jeong
F Larsen
F Larsen
FA Quiocho
Gajendra P. S. Raghava
GP Raghava
H Kaur
H Kaur
H Nassif
Harinder Singh
HR Ansari
IB Kuznetsov
JS Chauhan
K Julenius
L Sompayrac
LH Bouwman
M Kulharia
M Kumar
M Kumar
M Muraki
M Patra
M Rashid
M Rashid
MM Gromiha
MS Sujatha
N Bhardwaj
Nitish Kumar Mishra
NK Mishra
RA Bauer
S Ahmad
S Hakomori
Sandhya Agarwal
SF Altschul
T Joachims
V Sobolev
VSR Rao
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

BACKGROUND: Mannose binding proteins (MBPs) play a vital role in several biological functions such as defense mechanisms. These proteins bind to mannose on the surface of a wide range of pathogens and help in eliminating these pathogens from our body. Thus, it is important to identify mannose interacting residues (MIRs) in order to understand mechanism of recognition of pathogens by MBPs. RESULTS: This paper describes modules developed for predicting MIRs in a protein. Support vector machine (SVM) based models have been developed on 120 mannose binding protein chains, where no two chains have more than 25% sequence similarity. SVM models were developed on two types of datasets: 1) main dataset consists of 1029 mannose interacting and 1029 non-interacting residues, 2) realistic dataset consists of 1029 mannose interacting and 10320 non-interacting residues. In this study, firstly, we developed standard modules using binary and PSSM profile of patterns and got maximum MCC around 0.32. Secondly, we developed SVM modules using composition profile of patterns and achieved maximum MCC around 0.74 with accuracy 86.64% on main dataset. Thirdly, we developed a model on a realistic dataset and achieved maximum MCC of 0.62 with accuracy 93.08%. Based on this study, a standalone program and web server have been developed for predicting mannose interacting residues in proteins (http://www.imtech.res.in/raghava/premier/). CONCLUSIONS: Compositional analysis of mannose interacting and non-interacting residues shows that certain types of residues are preferred in mannose interaction. It was also observed that residues around mannose interacting residues have a preference for certain types of residues. Composition of patterns/peptide/segment has been used for predicting MIRs and achieved reasonable high accuracy. It is possible that this novel strategy may be effective to predict other types of interacting residues. This study will be useful in annotating the function of protein as well as in understanding the role of mannose in the immune system

Relationship between amino acid composition and gene expression in the mouse genome

Abstract Background Codon bias is a phenomenon that refers to the differences in the frequencies of synonymous codons among different genes. In many organisms, natural selection is considered to be a cause of codon bias because codon usage in highly expressed genes is biased toward optimal codons. Methods have previously been developed to predict the expression level of genes from their nucleotide sequences, which is based on the observation that synonymous codon usage shows an overall bias toward a few codons called major codons. However, the relationship between codon bias and gene expression level, as proposed by the translation-selection model, is less evident in mammals. Findings We investigated the correlations between the expression levels of 1,182 mouse genes and amino acid composition, as well as between gene expression and codon preference. We found that a weak but significant correlation exists between gene expression levels and amino acid composition in mouse. In total, less than 10% of variation of expression levels is explained by amino acid components. We found the effect of codon preference on gene expression was weaker than the effect of amino acid composition, because no significant correlations were observed with respect to codon preference. Conclusion These results suggest that it is difficult to predict expression level from amino acid components or from codon bias in mouse.</p

Predicting residue-wise contact orders in proteins by support vector regression

Author: A Bairoch
AG Murzin
AR Kinjo
AR Kinjo
AR Kinjo
AR Kinjo
B Rost
CH Tsai
D Kihara
D Sarda
DT Jones
G Pollastri
G Pollastri
GP Raghava
HM Berman
J Song
J Wang
Jiangning Song
JM Chandonia
Kevin Burrage
KW Plaxco
M Punta
MPS Brown
NP Prabhu
S Ahmad
S Hua
S Hua
V Vapnik
V Vapnik
W Kabsch
W Liu
X Wang
Z Yuan
Z Yuan
Z Yuan
Z Yuan
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: The residue-wise contact order (RWCO) describes the sequence separations between the residues of interest and its contacting residues in a protein sequence. It is a new kind of one-dimensional protein structure that represents the extent of long-range contacts and is considered as a generalization of contact order. Together with secondary structure, accessible surface area, the B factor, and contact number, RWCO provides comprehensive and indispensable important information to reconstructing the protein three-dimensional structure from a set of one-dimensional structural properties. Accurately predicting RWCO values could have many important applications in protein three-dimensional structure prediction and protein folding rate prediction, and give deep insights into protein sequence-structure relationships. RESULTS: We developed a novel approach to predict residue-wise contact order values in proteins based on support vector regression (SVR), starting from primary amino acid sequences. We explored seven different sequence encoding schemes to examine their effects on the prediction performance, including local sequence in the form of PSI-BLAST profiles, local sequence plus amino acid composition, local sequence plus molecular weight, local sequence plus secondary structure predicted by PSIPRED, local sequence plus molecular weight and amino acid composition, local sequence plus molecular weight and predicted secondary structure, and local sequence plus molecular weight, amino acid composition and predicted secondary structure. When using local sequences with multiple sequence alignments in the form of PSI-BLAST profiles, we could predict the RWCO distribution with a Pearson correlation coefficient (CC) between the predicted and observed RWCO values of 0.55, and root mean square error (RMSE) of 0.82, based on a well-defined dataset with 680 protein sequences. Moreover, by incorporating global features such as molecular weight and amino acid composition we could further improve the prediction performance with the CC to 0.57 and an RMSE of 0.79. In addition, combining the predicted secondary structure by PSIPRED was found to significantly improve the prediction performance and could yield the best prediction accuracy with a CC of 0.60 and RMSE of 0.78, which provided at least comparable performance compared with the other existing methods. CONCLUSION: The SVR method shows a prediction performance competitive with or at least comparable to the previously developed linear regression-based methods for predicting RWCO values. In contrast to support vector classification (SVC), SVR is very good at estimating the raw value profiles of the samples. The successful application of the SVR approach in this study reinforces the fact that support vector regression is a powerful tool in extracting the protein sequence-structure relationship and in estimating the protein structural profiles from amino acid sequences

Queensland University of Technology ePrints Archive

University of Queensland eSpace

Protein sequence alignment with family-specific amino acid similarity matrices

Author: A Agrawal
A Prlić
AR Panchenko
B Qian
B Rost
C Notredame
CB Do
CN Cavasotto
G Vogt
GH Gonnet
GP Raghava
I Van Walle
Igor B Kuznetsov
IN Shindyalov
J Pei
J Söding
JD Blake
JD Thompson
JM Sauder
JS Bernardes
K Mizuguchi
L Holm
L Lo Conte
ML Sierk
MO Dayhoff
MS Johnson
RB Vilim
RC Edgar
RC Edgar
RC Edgar
S Henikoff
S Salem
SB Needleman
SE Brenner
SF Altschul
SR Eddy
T Müller
TF Smith
V Ahola
WR Pearson
WR Taylor
Y Liu
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Predicting Class II MHC-Peptide binding: a kernel based approach using similarity scores

Author: AJ Godkin
B Efron
B Schölkopf
B Schölkopf
CK Hattotuwagama
D Haussler
DA Rhodes
Darren R Flower
FR Burden
G Bonomi
GP Raghava
H Kropshofer
H Noguchi
H Noguchi
H Rammensee
H Saigo
IA Doytchinova
J Hammer
J Hammer
J Xia
JC Tong
JD Blake
Jesper Salomon
JP Vert
JW Yewdell
M Bhasin
M Bhasin
M Nielsen
M Xiao YS
MH Wauben
N Murugan
O Karpenko
P Donnes
P Guan
PY Arnold
R Kuang
RR Mallios
RT Carson
S Henikoff
S Kawashima
SF Altschul
T Muller
T Muller
TF Smith
V Brusic
V Brusic
VN Vapnik
W Liu
Y Bengio
Z Dosztanyi
Z Zavala-Ruiz
Z Zavala-Ruiz
ZR Yang
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Modelling the interaction between potentially antigenic peptides and Major Histocompatibility Complex (MHC) molecules is a key step in identifying potential T-cell epitopes. For Class II MHC alleles, the binding groove is open at both ends, causing ambiguity in the positional alignment between the groove and peptide, as well as creating uncertainty as to what parts of the peptide interact with the MHC. Moreover, the antigenic peptides have variable lengths, making naive modelling methods difficult to apply. This paper introduces a kernel method that can handle variable length peptides effectively by quantifying similarities between peptide sequences and integrating these into the kernel. RESULTS: The kernel approach presented here shows increased prediction accuracy with a significantly higher number of true positives and negatives on multiple MHC class II alleles, when testing data sets from MHCPEP [1], MCHBN [2], and MHCBench [3]. Evaluation by cross validation, when segregating binders and non-binders, produced an average of 0.824 A(ROC )for the MHCBench data sets (up from 0.756), and an average of 0.96 A(ROC )for multiple alleles of the MHCPEP database. CONCLUSION: The method improves performance over existing state-of-the-art methods of MHC class II peptide binding predictions by using a custom, knowledge-based representation of peptides. Similarity scores, in contrast to a fixed-length, pocket-specific representation of amino acids, provide a flexible and powerful way of modelling MHC binding, and can easily be applied to other dynamic sequence problems

Aston Publications Explorer

Oxford University Research Archive

Molecular evidence for increased regulatory conservation during metamorphosis, and against deleterious cascading effects of hybrid breakdown in Drosophila

Abstract Background Speculation regarding the importance of changes in gene regulation in determining major phylogenetic patterns continues to accrue, despite a lack of broad-scale comparative studies examining how patterns of gene expression vary during development. Comparative transcriptional profiling of adult interspecific hybrids and their parental species has uncovered widespread divergence of the mechanisms controlling gene regulation, revealing incompatibilities that are masked in comparisons between the pure species. However, this has prompted the suggestion that misexpression in adult hybrids results from the downstream cascading effects of a subset of genes improperly regulated in early development. Results We sought to determine how gene expression diverges over development, as well as test the cascade hypothesis, by profiling expression in males of <it>Drosophila melanogaster</it>, <it>D. sechellia</it>, and <it>D. simulans</it>, as well as the <it>D. simulans </it>(♀) × <it>D. sechellia </it>(♂) male F1 hybrids, at four different developmental time points (3rd instar larval, early pupal, late pupal, and newly-emerged adult). Contrary to the cascade model of misexpression, we find that there is considerable stage-specific autonomy of regulatory breakdown in hybrids, with the larval and adult stages showing significantly more hybrid misexpression as compared to the pupal stage. However, comparisons between pure species indicate that genes expressed during earlier stages of development tend to be more conserved in terms of their level of expression than those expressed during later stages, suggesting that while Von Baer's famous law applies at both the level of nucleotide sequence and expression, it may not apply necessarily to the underlying overall regulatory network, which appears to diverge over the course of ontogeny and which can only be ascertained by combining divergent genomes in species hybrids. Conclusion Our results suggest that complex integration of regulatory circuits during morphogenesis may lead to it being more refractory to divergence of underlying gene regulatory mechanisms - more than that suggested by the conservation of gene expression levels between species during earlier stages. This provides support for a 'developmental hourglass' model of divergence of gene expression in <it>Drosophila </it>resulting in a highly conserved pupal stage.</p

Public Library of Science (PLOS)

NatF Contributes to an Evolutionary Shift in Protein N-Terminal Acetylation and Is Important for Normal Chromosome Segregation

N-terminal acetylation (N-Ac) is a highly abundant eukaryotic protein modification. Proteomics revealed a significant increase in the occurrence of N-Ac from lower to higher eukaryotes, but evidence explaining the underlying molecular mechanism(s) is currently lacking. We first analysed protein N-termini and their acetylation degrees, suggesting that evolution of substrates is not a major cause for the evolutionary shift in N-Ac. Further, we investigated the presence of putative N-terminal acetyltransferases (NATs) in higher eukaryotes. The purified recombinant human and Drosophila homologues of a novel NAT candidate was subjected to in vitro peptide library acetylation assays. This provided evidence for its NAT activity targeting Met-Lys- and other Met-starting protein N-termini, and the enzyme was termed Naa60p and its activity NatF. Its in vivo activity was investigated by ectopically expressing human Naa60p in yeast followed by N-terminal COFRADIC analyses. hNaa60p acetylated distinct Met-starting yeast protein N-termini and increased general acetylation levels, thereby altering yeast in vivo acetylation patterns towards those of higher eukaryotes. Further, its activity in human cells was verified by overexpression and knockdown of hNAA60 followed by N-terminal COFRADIC. NatF's cellular impact was demonstrated in Drosophila cells where NAA60 knockdown induced chromosomal segregation defects. In summary, our study revealed a novel major protein modifier contributing to the evolution of N-Ac, redundancy among NATs, and an essential regulator of normal chromosome segregation. With the characterization of NatF, the co-translational N-Ac machinery appears complete since all the major substrate groups in eukaryotes are accounted for

University of Bergen

Ghent University Academic Bibliography